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Assessing Fit of the Lognormal Model for Response Times 


Abstract 

Response-time models are of increasing interest in educational and psychological testing. 
This paper focuses on the lognormal model for response times (van der Linden, 2006), 
which is one of the most popular response-time models. Several existing statistics for 
testing normality and the fit of factor-analysis models are repurposed for testing the fit of 
the lognormal model. A simulation study and two real data examples demonstrate the 
usefulness of the statistics. The Shapiro-Wilk test of normality (Shapiro & Wilk, 1965) 
and a Z-test for factor analysis models (Maydeu-Olivares, 2017) were the most powerful in 


assessing the misfit of the lognormal model. 
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With the increasing popularity of computerized testing, which makes recording of 
response times straightforward, analysis of response times has become a rapidly expanding 
field of research. A common way to analyze response times is to include them in 
psychometric/statistical models that are referred to as response-time models (RTMs). The 
use of RTMs has been suggested to improve precision of examinee ability estimates (e.g. 
Bolsinova & Tijmstra, 2018; van der Linden, Klein Entink, & Fox, 2010), to detect test 
fraud (e.g., Qian, Staniewska, Reckase, & Woo, 2016; van der Linden & Guo, 2008; Sinharay 
& Johnson, 2019), to detect speededness (e.g., Schnipke & Scrams, 1997), to improve test 
construction (e.g. van der Linden, 2007), and to test substantive theories about cognitive 
processes (e.g., van der Maas, Molenaar, Maris, Kievit, & Borsboom, 2011). Several RTMs 
have been suggested by, for example, Bolsinova and Tijmstra (2018), Klein Entink, Fox, 
and van der Linden (2009), Klein Entink, van der Linden, and Fox (2009), Maris (1993), 
Maris and van der Maas (2012), Rasch (1960), Schnipke and Scrams (1997), Thissen (1983), 
van der Linden (2006), van der Linden (2007), van der Maas et al. (2011), and Wang and 
Hanson (2005). Extensive reviews of RTMs include De Boeck and Jeon (2019), Kyllonen 
and Zu (2016), Lee and Chen (2011), Schnipke and Scrams (2002), van der Linden (2009), 
and van Rijn and Ali (2017). 

The lognormal model for response times (LNMRT) is arguably one of the most popular 
RTMs. The model was first suggested by Thissen (1983), was further developed by van der 
Linden (2006), and has been considered, either to analyze only the response times, or to 
jointly analyze the response times and response accuracies, by several researchers including 
Bolsinova and Tijmstra (2018), Boughton, Smith, and Ren (2017), Glas and van der Linden 
(2010), Qian et al. (2016), Sinharay (2018), Sinharay and Johnson (2019), van der Linden 
(2007), van der Linden (2009), van der Linden and Glas (2010), and van der Linden and 
Guo (2008). 

There is a lack of research on model-fit statistics for the LNMRT, Ranger and Kuhn 
(2014), Glas and van der Linden (2010) and van der Linden and Glas (2010) being among 
the few exceptions. This paper, in an attempt to fill that void, brings to bear several tools 


that have been used to assess fit of other statistical models to test item fit and the local 


independence assumption for the LNMRT. 

The next section includes a review of the LNMRT, existing approaches for estimation 
of the parameters of the model, and existing approaches for the assessment of fit of the 
model. The Methods section includes discussions of the model-fit statistics that we propose 
for assessing the fit of the LNMRT. The Simulations section includes an evaluation of 
the Type I error rate and the power of the statistics. The Real Data section includes 
applications of the statistics to two operational data sets. Discussions and conclusions are 


provided in the last section. 


Reviews of the Lognormal Model, Fit Statistics, and Normality Tests 
The Lognormal Response Time Model 
The Model 


Let us consider a test that includes J items. Let t;; denote the response time, which is 
typically defined as the time an examinee spends on an item in a test, of examinee 7 on item 


j, where i = 1,2,...,/,, 7 =1,2,...,J, and I is the number of examinees. Let us define 
Vij = log(ti;): 


According to the LNMRT, y;;’s, 7 = 1,2,..., J, are independent conditional on 7; for any i 
and 

EE ~N («, — Ti, =| ) (1) 

OG 

where N (1,07) denotes the normal distribution with mean p and variance o?. The 
parameter 7; is the examinee’s speed parameter; a larger value of the parameter results 
in smaller expected response time on all items for the examinee. The parameter /; 
is the time-intensity parameter for item 7; a larger value of the parameter results in 
larger expected response times for all examinees. The parameter a; is the discrimination 
parameter for item 7; a larger value of the parameter leads to more information on and 
hence smaller standard error of the examinee speed parameters. Given that the conditional 


mean of yi; is 8; — 7, one needs to impose a restriction on the model parameters to ensure 
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identifiability. This paper assumes that the population distribution g(7;) on 7; is N(0, 07) 
for an unknown parameter o?, as in van der Linden and Guo (2008), which imposes the 
restriction that the population mean is 0; this restriction is similar to the restriction 
that the population mean of the examinee abilities is 0 in marginal maximum likelihood 
estimation in item response theory (e.g., Bock & Aitkin, 1981). 

In applications of the LNMRT, researchers have analyzed only response times using 
the stand-alone LNMRT (e.g., Finger & Chee, 2009; Sinharay, 2018; van der Linden, 2006) 
or jointly analyzed both response times and response accuracies using the LNMRT and 
an item-response theory (IRT) model (e.g., Glas & van der Linden, 2010; van der Linden, 
2007; van der Linden & Glas, 2010). 


Estimation of the Item Parameters of the Model 


A Markov chain Monte Carlo algorithm was suggested by van der Linden (2006) to 
estimate the parameters of the LNMRT. Glas and van der Linden (2010) suggested an 
approach to compute the maximum likelihood estimates (MLEs) of the item parameters 
when the LNMRT is used along with the three-parameter logistic model (3PLM) to jointly 
analyze both response times and response accuracy. Finger and Chee (2009) showed how 
one can use factor analysis to obtain the marginal maximum likelihood estimates (MMLE) 
of the item parameters of the stand-alone LNMRT and researchers such as Molenaar, 
Tuerlinckx, and van der Maas (2015) showed how one can use factor analysis to obtain 
the MMLEs of the item parameters of the joint model involving the LNMRT and an IRT 
model. Under the LNMRT, y; can be expressed as 


Ui = Be Te ey: (2) 


a 


where ¢€,;’s are independent of 7;’s, E(ej;) = 0, and Var(eé;) = <2. The above equation 
a. 


implies that 
y,-B=—-T1+ &, (3) 


where y; = (Ya, Yi2,---,Yir)’, B = (G1, B2,.--, 87)’, 1 is a J x 1 vector of all 1’s, 
€; = (€i1, €i2,-.-, 7)’, E(€) = 0, and Var(e) = D, where the j-th diagonal element of the 
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J x J matrix D is equal to a? + =a and the off-diagonal elements of D are equal to o?. 
Equation 3 is like the equation of so tactorsannin te model with one common factor 7; where 
all the factor loadings are restricted to be equal to 1 (e.g., Joreskog, 1967)." 

Therefore, we used the R package lavaan (Rosseel, 2012), which is used to perform 
factor analysis and structural equation modeling (SEM), to estimate the item parameters of 
the LNMRT.? The codes for using the lavaan package to compute the MMLEs for LNMRT 
are provided in Appendix A. Molenaar et al. (2015) noted that the LNMRT, when used as 


a component of a joint model, can be fitted using standard SEM software packages such as 


EQS, Lisrel, Mplus, and Mx. 


The Need to Assess the Fit of the LNMRT as a Stand-alone Model 


According to the Standard 4.10 of the Standards for Educational and Psychological 
Testing (American Educational Research Association, American Psychological Association, 
& National Council for Measurement in Education, 2014), evidence of model fit should 
be documented when model-based methods are used. Therefore, it is important to assess 
the fit of the LNMRT. Given that the LNMRT was first presented as a stand-alone model 
by van der Linden (2006) and that the LNMRT has been used as a stand-alone model in 
various applications by, for example, Boughton et al. (2017), Marianti, Fox, Avetisyan, 
Veldkamp, and Tijmstra (2014), Qian et al. (2016), and Sinharay (2018), there is a need of 
more research on assessing the fit of the LNMRT as a stand-alone model. Also, a necessary 
condition of the joint model (consisting of the LNMRT and an IRT model) fitting both the 
response times and response accuracies is that the LNMRT fits the response times. Thus, if 
a simple test of fit of the LNMRT as a stand-alone model shows misfit, then one may not 
need to fit the joint model and instead can proceed with another RTM. 


Given the two results that 


e the effect of violation of the normality assumption is often negligible (e.g., Scheffe, 


'The vector @ is like the mean vector in the factor-analysis model. 
In limited simulations (the results of which are not reported here), the MMLEs of the item parameters 


produced by lavaan were found to be very accurate. 


1959, p. 337), 
e all models are wrong but some are useful (Box & Draper, 1987, p. 54), 


one wonders whether the LNMRT is always useful, in terms of yielding accurate/valid 
inferences, because of its underlying normality assumption, or whether some types of misfit 
of the LNMRT have practical consequences and hence threaten the validity of the inferences 
from the LNMRT. 

A substantial number of applications of the LNMRT involve the detection of test fraud. 
For example, Boughton et al. (2017), Fox and Marianti (2017), Marianti et al. (2014), 
Qian et al. (2016), Sinharay (2018), Sinharay and Johnson (2019), and van der Linden and 
Guo (2008) used the LNMRT to detect various types of test fraud. Specifically, person-fit 
analysis, which is one of the six types of statistical methods that are used in practice 
to detect test fraud according to Wollack and Schoenig (2018), was performed using the 
LNMRT in Marianti et al. (2014), Fox and Marianti (2017), and Sinharay (2018). Let 
us study the behavior of the Bayesian person-fit test statistic I’ of Marianti et al. (2014) 
under the misfit of the LNMRT. Klein Entink et al. (2009) showed that the LNMRT does 
not adequately fit data simulated from the Box-Cox normal model. A data set of response 
times of 5,000 examinees to 80 items was simulated from the Box-Cox normal model.? 
Then the LNMRT was fitted to the data set and the I‘ statistic (Marianti et al., 2014) 
was computed for all examinees. At 5% level of significance, the I’ statistict showed a 
statistically significant misfit for about 8% examinees whereas the misfit percentage would 
be 5% if the LNMRT were fitted to data from the LNMRT or the Box-Cox normal model 
were fitted to the data set. Thus, about 3% more examinees (or 150 examinees in the 
sample of 5,000) would be erroneously flagged as possible cheaters when the poor-fitting 
LNMRT is used instead of the better-fitting Box-Cox normal model. If data with even a 


3The v parameter, which indicates the skewness of the corresponding distribution, for the items was set 
to a mix of values 0.1, 0.2, and 0.3—Klein Entink et al. (2009) found the estimates of v to be close to these 


values for some items for a real data set. 
4Whose Type I error rate has been found to be close to the nominal level by Marianti et al. (2014) and 


Sinharay (2018). 


larger extent of misfit of the LNMRT are simulated (by setting v larger than 0.3), then 
the percent of examinees erroneously marked as possible cheaters would be even larger. 
Given the serious consequences of false alarms in the context of detection of test fraud (e.g., 
Skorupski & Wainer, 2017, p. 347), this data example shows that the LNMRT may be 
less useful than desired when it does not fit the data and its misfit may have practical 
consequences in some contexts and provides one more reason of the assessment of the fit of 


the LNMRT in all applications of the model to real data. 


Existing Approaches for Assessment of Fit for the LNMRT 

Schnipke and Scrams (1999) suggested the use of graphical plots and the root mean 
squared error between the observed and predicted cumulative distribution function of 
response times to assess the fit of the LNMRT. Several model-fit statistics have also 
been suggested in the context of applications of the LNMRT as a component of a joint 
model. These include the several Lagrange Multiplier statistics including one to assess item 
fit (Glas & van der Linden, 2010), the Lagrange Multiplier statistic for assessing conditional 
independence of the responses and response times (van der Linden & Glas, 2010), and the 
item-fit statistics of Ranger and Kuhn (2014).° Some of these methods can be adapted to 
applications of the LNMRT as a stand-alone model. The few model-fit tools that have been 
suggested for the LNMRT as a stand-alone model include the Lagrange Multiplier test for 
assessing conditional independence of the response times (van der Linden & Glas, 2010) 
and the Bayesian residuals based on the posterior predictive distribution of response times 
(van der Linden & Guo, 2008). 

The item-fit statistic of Glas and van der Linden (2010) is designed to test the null 
hypothesis Hp that the response times for item 7 follow the LNMRT given by Equation 1 
versus the alternative hypothesis H, that the response times follow the distribution 


_; 1 ; 
wil N (8) — + Sulu) oy) P= LB (4) 


j 


°Posterior-predictive person-fit tests have been suggested by Marianti et al. (2014) and Fox and Marianti 


(2017), but this paper does not consider person fit. 


where 6 is an unknown parameter that can be estimated from the data and y; J ig the 


collection of log-response times of the i-th individual on all items except Item 7, and 


_ 1 if the sum of the elements of y,! an: 
wy; ’) = a 
0 if the sum of the elements of y;’ > 1, 


where r is an appropriate cut point that divides the examinees in two groups of roughly 
equal sizes. The alternative hypothesis essentially states that the mean of the response 
time is larger (compared to what is expected under LNMRT) for the slow examinees and 
smaller for the fast examinees or vice versa. Then the item-fit statistic of Glas and van 
der Linden (2010), henceforth denoted as the LM; statistic, is obtained as the Lagrange 
Multiplier statistic for testing the null hypothesis that 6 is equal to 0.° For large sample 
sizes, the distribution of the LM; statistic can be approximated by a x? distribution with 
one degree of freedom (df) when the LNMRT fits the data. 

To compute the item-fit statistic of Ranger and Kuhn (2014) for item j, one first fits 
the RTM to the available data and then divides the examinees into G groups based on their 
response times on an item. Then, one counts the number of examinees who belong to these 
groups—this results in a total of G counts. Let us denote the collection of these counts for 
item J as 0j1, Oj2, ..., Ojqg. One then computes ejg’s, which are the expected value of the 
0j9'8, under the assumption that the joint model fits the data. One finally computes the 
statistic a 

T; =) (0j9 — eg)” 
g=1 
that quantifies the extent of model fit in item 7. Ranger and Kuhn (2014) proved that the 
distribution of T; is a multiple of a x? distribution. 

In the Bayesian approach of van der Linden and Guo (2008), one determines if the 
response time of an examinee-item combination is substantially different from what is 
expected under the model. van der Linden and Guo (2008) showed that in applications of 
the LNMRT as a stand-alone model, the posterior distribution of the predicted value of y;, 


®Glas and van der Linden (2010) described the test using a vector-valued 5, but used a scalar 6 in their 


data examples—we describe the test using a scalar 6 for simplicity. 


conditional on y;’ is normal. Then, the standardized residual is computed as 


6g = 2 (vali), (5) 
Var(yij|y; ”) 


If the absolute value of e;; is larger than an appropriate quantile of the standard normal 
distribution, the response time for the examinee on item 7 is concluded as aberrant. One 
can compute the e,;’s for an item over all examinees and then conclude misfit for the item 
if the number of statistically significant standardized residuals for the item is larger than 
an appropriate cutoff (as in, for example, Boughton et al., 2017, p. 184). In our simulations 
(to be discussed later), this approach to assess item fit was found to have smaller power 
than the other item-fit statistics—so this approach is not considered henceforth. 

van der Linden and Glas (2010) stated that under violation of local independence for 
item pairs {j,k}, the response times on these items for examinee i will follow a bivariate 


distribution given by 


Qj Ak —1 

F(Yig, Yor) = i exp 15 l 2 (wi; We _ ous) , (6) 
2m4/1— p%,, (1 — pj.) 

where Wj; = aj(yij — 6; +7;). They defined a Lagrangian multiplier statistic to test for local 


independence of response times, which implies that p,, is equal to 0, as 


(do, Pigiz)? een 
Y; (83, + da, — 1 — Coetegta? 


m mn 


LM. = 


(7) 


where Wij =a,;(yi; — 8; + 7;) and 7; is the MLE of 7; and is given by 


Bo yf 07 Oi = Ua) 
25 OG 


The statistic LM;, follows the y? distribution with one degree of freedom when the local 


(8) 


independence assumption holds. 
Even though there exist several tools for testing the fit of the LNMRT, there seems 
to be a need for further research on assessing model fit in applications of the LNMRT as 


a stand-alone model. For example, there exists no comparison studies of the fit statistics 


for the model. In addition, while there exist several \?-type statistics for assessing item 
fit for IRT models (e.g., Orlando & Thissen, 2000), there exist no y?-type statistics for 
assessing item fit for the LNMRT. This paper intends to fill this void by suggesting the 
use of several statistics to assess item fit for the LNMRT—these statistics have been used 
to assess the fit of other statistical models, but not to assess the fit of the LNMRT or any 


other response-time model. 


Tests of Normality 


Adefisoye, Golam Kibria, and George (2016) pointed to the existence of more than 40 


tests of normality. These tests can be classified into the following three categories: 


e tests based on empirical distribution function: Examples of such tests are the Kolmogorov- 
Smirnov test (e.g., Lilliefors, 1967), the Anderson-Darling test (Anderson & Darling, 
1954), and the Lilliefors test (Lilliefors, 1967); 


e tests based on moments: Examples of such tests are Jarque-Bera test (Jarque & Bera, 


1980) and those based on skewness and kurtosis (e.g., Mardia, 1970); 


e tests based on correlation: Examples of such tests are D’Agostino test (D’Agostino, 
1971), Shapiro-Wilk test (Shapiro & Wilk, 1965), and Weisberg-Bingham test (Weis- 
berg & Bingham, 1975). 


In addition, as Thode (2002) noted, y? type goodness-of-fit tests such as the Pearson’s y? 
test (Pearson, 1900) can also be used to test for normality. 

There also exist several comparison studies of normality tests including Adefisoye et al. 
(2016), Gan and Koehler (1990), Razali and Wah (2011), Shapiro, Wilk, and Chen (1968), 
and Yazici and Yolacan (2007). Using data simulated from 45 different distributions and 
nine statistics, Shapiro et al. (1968) showed that the Shapiro-Wilk test statistic provided a 
general superior measure of non-normality. Seber (1984, p. 147-148) stated that among the 
tests of normality, the Shapiro-Wilk, Anderson-Darling, and D’ Agostino test statistics are 


the most useful. Gan and Koehler (1990) compared the power of several normality tests and 


found the Shapiro-Wilk test to be the best overall test for assessing normality. Yazici and 
Yolacan (2007) found three tests including the Jarque-Bera test to be the most powerful in 
a comparison of 12 tests of normality, but also found the Shapiro-Wilk test to be a superior 
omnibus indicator of normality. Razali and Wah (2011) compared the power of four 
tests—Shapiro-Wilk test, Anderson-Darling test, Lilliefors test, and Kolmogorov-Smirnov 
test—and found the Shapiro-Wilk test to be the most powerful. Adefisoye et al. (2016) 
found that the test based on kurtosis is the most powerful for observations from a symmetric 
distribution and the Shapiro-Wilk test is the most powerful for observations from an 
asymmetric distribution. The Nikulin-Rao-Robson test (Nikulin, 1973; Rao & Robson, 
1974) statistic has also been found more powerful than tests of normality in detecting 
departures from normality (e.g., Voinov, Pya, & Alloyarova, 2009) and possesses several 
optimality properties (e.g., Singh, 1987; Voinov, Nikulin, & Balakrishnan, 2013, p. 37). In 
the simulation study later in this paper, the Shapiro-Wilk test, the Anderson-Darling test, 
and Nikulin-Rao-Robson test are used to test for item fit. Brief descriptions of these three 


tests are provided below. 


Shapiro- Wilk Test 


In a test for normality based on observations X,, Xo, ..., X 7, the Shapiro-Wilk test 
statistic (Shapiro & Wilk, 1965) takes the form 


(S2,a:X wy) 


W= —, (9) 
DA (X; a Xx) 
where X(;),7 = 1,2,...,/, are a re-arrangement of the X;’s in an increasing order and the 
vector of weights a;’s, @ = (a1, d2,...,a@7)’s, can be computed as 
mv 
a= : 
Vm'V-'V-lm 


where m and V respectively denote the mean vector and the covariance matrix of the 
standard normal order statistics. Tables of the critical values of the asymptotic null 
distribution of W can be found in, for example, Shapiro and Wilk (1965), and the 


statistic and its p-values can be computed using popular statistical software packages. 
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The p-values for the W statistic were computed in this paper by applying the R function 
shapiro.test (https: //stat.ethz.ch/R-manual/R-devel/library /stats/html/shapiro.test.html) 
on the logarithm of the response times for each item. A p-value smaller than the nominal 
level indicates an item misfit in the form of a departure from normality of the logarithm of 
the response times. Schnipke and Scrams (1999) used normal probability plots to assess 
fit of the LNMRT; however, the use of a normal probability plot involves a subjective 
judgment on misfit whereas the Shapiro-Wilk test statistic quantifies the misfit in a test 
statistic and a p-value—so the use of the Shapiro-Wilk test is more convenient, especially 


when these tests have to be done on a large scale. 


Anderson-Darling Test 


The Anderson-Darling test (Anderson & Darling, 1954) based on observations X1, Xo, 
..., X,; involves the use of the statistic 
1 
AP = ~~ 5 7(2 — 1)flog 8(%) + log(d — ®(¥1-143))) 


v 


where Y; = (Xj) — X)/s, X is the mean of the X;’s and s = Va >>, (X; — X)?. Tables of 
the critical values of the asymptotic null distribution of A? can be found in, for example, 
D’Agostino (1986), and the statistic and the p-values for the statistic can be computed 
using popular statistical software packages. The p-values for A? were computed in this 
paper using the R function ad.test included in the R package nortest (Gross & Ligges, 2015) 
on the logarithm of the response times for each item. A p-value smaller than the nominal 
level indicates an item misfit in the form of a departure from normality of the logarithm of 


the response times. 


Nikulin-Rao-Robson \? Test 

To apply x? type goodness-of-fit tests to assess the normality for a sample of 
observations X,, Xo, ..., X7, the observations are divided into G groups based on the values 
of the X;’s so that the expected number of observations under the normal model is equal 


over the groups; such groups can be created by using group boundaries that are of the form 


It 


X + qs, where X = 7 X; and q is a quantile of the standard normal distribution; for 
example, if 10 groups are used (that is, G=10), then the nine deciles of the standard normal 
distribution are used as the q’s so that the group boundaries are X — 1.288, X — 0.84s, 
..., X +1.28s.7 Then, O,, the observed number of individuals belonging to Group g, is 
computed. The expected number of examinees in each group is zt because of the way the 
groups are constructed. The Pearson’s x” statistic represents the extent of departure of the 
expected and observed number of examinees falling in each group and is defined as 


Oey 
gas) ee =" or-1 
G 9 


g 


The Nikulin-Rao-Robson y? statistic (Nikulin, 1973; Rao & Robson, 1974), which is also 
called the Rao-Robson y? statistic, involves a correction term to the Pearson’s y? statistic 


and is defined as 
1 ' 1 : 
Xn =X? + T bs «0,] af T (= 6.) ’ (10) 
g g 


where €,’s and ¢,’s are weights that depend on the normal density function and its 
derivative; their expressions can be found in Nikulin (1973). The x3, statistic has an 
asymptotic y2_, null distribution and a large value of y% indicates a misfit of the normal 


model to the data. 


Methods 


In this section, several existing statistics for testing normality and the fit of factor 
analysis models are repurposed for testing item fit and the assumption of local independence 


of the LNMRT. 
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—oo and oo are used as the lower boundary of the first group and the upper boundary of the 10th group, 


respectively. 
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Item Fit Analysis 


Equation 1 implies that for a randomly chosen examinee, the marginal distribution of 

Yij is given by 

yig ~ N (4,,0° + =) . (11) 

o 

In addition, for an item, y;;’s of the different examinees are independent. Therefore, tests 
of normality (e.g., Thode, 2002) can be used to assess item fit for the LNMRT. We used 
several tests of normality including the Shapiro-Wilk test, the tests for skewness and 
kurtosis, the Anderson-Darling test, the Jarque-Bera test and the Lilliefors test to assess 
item fit. Results for the Shapiro-Wilk test and the Anderson-Darling test are reported later. 
R codes for computing the p-values for these tests are provided in Appendix A. The other 
three tests had low power for the types of misfit considered later in our simulations—so the 
results of these statistics are not discussed henceforth. 

We also used y%, the Nikulin-Rao-Robson ? statistic (Nikulin, 1973; Rao & Robson, 
1974), to test for item fit for the LNMRT. To compute the statistic, the number of groups, 
G, was set equal to 20, 30, and 50, respectively, for sample sizes of 500, 1,000, and 5,000 
in our simulations; these values are in agreement with the recommendation of using 2/?/° 
groups (e.g, Moore, 1986, p. 70) with x?-type tests; the use of other number of groups 
led to slightly smaller power of y% in a preliminary investigation. Moore (1986, p. 92) 
commented that “Among the chi-square statistics proposed and studied to date, the 
Rao-Robson statistic appears to have generally superior power and is therefore the statistic 
of choice for protection against general alternatives.” Also, the y% statistic has been 
found to outperform the Pearson’s x? statistic in power comparisons (e.g., Rao & Robson, 
1974; Voinov et al., 2009) and was more powerful than the Pearson’s y? statistic in our 
study—therefore results for the Pearson’s x” statistic are not provided. Also, given the 
optimality properties of the y%, statistic (e.g., Singh, 1987), the statistic is expected to 
be more powerful than Ranger-Kuhn’s x; statistic which, like 3, is computed from the 
differences between observed and expected frequencies, but, unlike y*,, does not involve a 


division by the expected frequencies. R. codes for computing the \% statistic for an item 
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are provided in Appendix A. There have not been too many comparison studies between 
normality tests such as the Shapiro-Wilks test and y? tests for assessing normality of data. 
In addition, the advantage of using y? tests to detect item fit of the LNMRT are that 

(a) they are in the same spirit as the y?-type statistics such as the Orlando-Thissen x? 
statistics (Orlando & Thissen, 2000) for assessing item fit of IRT models, and (b) they may 


have more power than normality tests to detect some types of misfit. 


Test for Local Independence 

Several statistics for testing the local independence of the item scores, such as the Q3 
statistic (Yen, 1984), are based on the correlation between the scores on a pair of items. 
Borrowing the idea, a test statistic based on the correlation between the logarithm of 
observed response-times on a pair of items can be used to test the local independence of the 
response times. 


Equation 1 and the population distribution g(7;) = (0, 07) implies that 


E(yij) = E(E(wylti)) = E08; — 7%) = 8; 
Cov (Yaz, Yik) Cov(E(yis|7:), E(yinlti)) = Cov(8; — 71, Be -— 7%) = 0? (12) 


1 1 
Var(yij) = E(Var(yij;|7:)) + Var(E(yij|7%)) = 7) + Var(6; — 7) = = eGo (13) 
j j 


Let 7;, denote the observed correlation coefficient between the vectors of the examinees’ 
log-response times on items j and k, where 7 # k. Given Equations 12 and 138, the 


corresponding population correlation coefficient under the LNMRT is 


Pik = (14) 


The population correlation p;;, involves unknown parameters o? and a;’s. Let /;, denote the 
estimate of p;,; one computes f,;, by replacing the parameters o? and a,’s in Equation 14 by 
their estimates from a sample. For factor-analysis models, researchers such as Ogasawara 
(2001) and Maydeu-Olivares (2017) provided approaches to compute the estimated standard 
deviation s(rj, — Pj~) of the residual rj, — pj, and Maydeu-Olivares (2017) showed that 
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asymptotically, 


ee La aL ~ N(0,1) 
8(T jx — Pik) 


under the null hypothesis of the model fitting the data. Given the earlier discussion on how 
the LNMRT can be expressed as a factor-analysis model, the Z,; statistic can be used to 
assess the local independence of the response-times for items j and & in applications of the 
LNMRT. A large absolute value of Z;; would indicate the violation of local independence 
for the correponding item pair. The values of the Z,; statistic were computed using the 
function lavResiduals in the R. package lavaan (Rosseel, 2012). R codes for computing the 
Zr Statistic are provided in Appendix A. Ogasawara (2001) and Maydeu-Olivares (2017) 
discussed other standardized residuals similar to Z,;, but, in limited simulations, they were 
found to perform similar to 7,7; therefore, results for only Z,; among the residuals are 
provided in this paper. Molenaar et al. (2015) used modification indices to test for local 
independence of response-time models, but the Zz; statistic provides a more rigorous and 
principled approach to test for the local independence for the LNMRT.® We also tried a 
version of Z;; in which the denominator was the standard deviation of r;,,, rather than that 
of rj~% — Pjk; While this version involves less computation compared to Zz;, it had smaller 


power compared to Zz; and is not considered henceforth. 


Simulation Study 


Three sets of simulations were performed to study the properties of and compare 
the performances of the following model-fit statistics: (a) Shapiro-Wilk statistic, 
(b) Anderson-Darling statistic, (c) Nikulin-Rao-Robson y? statistic, (d) Ranger-Kuhn’s 
T; statistic (Ranger & Kuhn, 2014), (e) Lagrange Multiplier item-fit statistic LM; (Glas 
& van der Linden, 2010), (f) Lagrange Multiplier local-independence statistic LM;;, 
(van der Linden & Glas, 2010), (g) Zzr statistic (Maydeu-Olivares, 2017). The first set 


of simulations, which involved analysis of data generated under no model misfit, was 


8While the modification index suggests whether a fixed parameter in a factor-analysis model can be freed 
(e.g., Satorra, 1989), it is not clear how the index can be used to test the hypothesis of local independence 


for an LNMRT. 
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intended to examine the Type I error rates of the aforementioned statistics. The second 
set of simulations, which involved analysis of data generated under some item misfit, was 
intended to examine the power of the item-fit statistics. The third set of simulations, which 
involved analysis of data generated under the violation of local independence, was intended 
to examine the power of the statistics for assessing local independence. Test lengths of 20, 
40, and 60 items and sample sizes of 500, 1,000 and 5,000 were considered in each of the 


three sets of simulations. 


Simulation of Data Under No Model Misfit 


Data under no misfit were simulated under the LNMRT given by Equation 1. The true 
values of a;’s and (;’s were simulated from a M(1.87,0.157) and a N(4, 0.45?) distribution, 
respectively. The true values of 7;’s were simulated from a N’(0, 0.37) distribution. These 
generating distributions were intended to make the summary of the simulated data resemble 
that of the real data described in van der Linden (2006). For example, with these generating 
distributions, the mean response times of the items were roughly between 25 and 150 
seconds and the mean response times of the persons were between 20 and 170 seconds; 
these roughly match the corresponding quantities in Figures 1 and 2 of van der Linden 
(2006). A total of 100 data sets were simulated for each of the 9 combinations of test length 
and sample size—a new set of true values of the parameters was used to simulate each of 
these 100 data sets. For each simulated data set, the MLEs of the item parameters were 
computed using the lavaan package (Rosseel, 2012) and then these MLEs were used to 
compute the item-fit statistics and the statistics for assessing local independence. Statistical 
significance for each statistic was determined by comparing the values of the statistic to the 


corresponding theoretical percentile. 


Simulation of Data Under Some Item Misfit 


In this set of simulations, the generated data sets included a majority of examinees 
whose response times followed the LNMRT given by Equation 1 (the response times of 


these examinees were simulated in a manner similar to that for data simulated under no 
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model misfit) and a small fraction of aberrant/misfitting examinees whose response times 
did not follow the LNMRT. The percent of aberrant examinees in a data set was assumed 
to be 2, 5, or 10. Each generated data set included a majority of items with no misfit and a 
few items with some misfit. The number of items with misfit was assumed to be 2, 4, and 
6, respectively, for test lengths of 20, 40, and 60, which means that misfit is assumed to 
be present for 10% of the items. The items with misfit and the aberrant examinees were 
randomly chosen for each simulated data set. 


Three types of item misfit were considered including those arising from 


e some examinees having preknowledge of the item 


e the response times for the item being simulated from the Box-Cox normal model (Klein 


Entink et al., 2009) 


e the response times for the item representing a positive shift for incorrect responses 


To create the first type of misfit (item preknowledge), it was assumed that the 
response times of the aberrant examinees followed the LNMRT given by Equation 1 for the 
non-misfitting (or non-compromised) items, but were equal to 10, 20, or 30 seconds for the 
misfitting (or compromised) items (that constitute 10% of all items on the test). To create 
the second type of misfit (Box-Cox normal model), it was assumed that the response times 
of the aberrant examinees followed the LNMRT given by Equation 1 for the non-misfitting 
items, but were simulated from the Box-Cox normal model (Klein Entink et al., 2009) 
with v-parameter 0.2, 0.5, or 0.8 for the misfitting items. To create the third type of 
misfit (positive shift for incorrect responses), the response times of the aberrant examinees 
for all the items were simulated from the LNMRT given by Equation 1, but a shift of 5, 
10, or 15 seconds was added to their response times for the misfitting items only if their 
responses on the items were incorrect;? this type of misfit represents the scenario that those 
not knowing the answer to an item often spend more time on the item and eventually 
answer the item incorrectly—Ranger and Kuhn (2014) simulated this type of misfit in their 


simulation study. 


°Item scores were generated for this case under the three-parameter logistic model. 


ef 


For each simulation condition represented by a test length, a sample size, a percent 
of aberrant examinees, and a specific magnitude of misfit (represented by the time, 
v-parameter, and shift for the three types of misfit), the following steps were iterated 100 


times: 
1. Simulate a data set with mostly non-aberrant examinees and some aberrant examinees; 
2. Compute the MLEs of the item parameters for the data set; 
3. Compute the item-fit statistics of the misfitting items using the MLEs computed above. 


The power of each item-fit statistic for each simulation condition was computed as the 
percent of misfitting items that had a significant value of the statistic under that simulation 


condition. 


Simulation of Data Under Violation of Local Independence 


In this set of simulations, generated data sets included a majority of item-examinee 
combinations for which the response times followed the LNMRT given by Equation 1 (the 
response times of these examinees were simulated in a manner similar to that for data 
simulated under no model misfit) and some other item-examinee combinations for which 
local dependence was simulated in one of two ways. 

To simulate the first type of local dependence, we assumed that 10, 20, or 40 percent 
of examinees suffer from speededness on one-fifth of the items at the end of the test and 
respond in 10, 20, or 30 seconds to those items. To simulate the second type of local 
dependence, we simulated response times for 10 item pairs using the bivariate distribution 
given by Equation 6 for 10, 20, or 40 percent of examinees. The correlation pj, (that 
quantifies the extent of local dependence) in Equation 6 was assumed to be 0.05, 0.1, or 0.2. 

For each simulation condition represented by a test length, a sample size, a percent of 
aberrant examinees, and a value of the number of items affected by speededness or a value 


of pjx, the following steps were iterated 100 times: 
1. Simulate a data set that involves violation of local independence; 
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2. Compute the MLEs of the item parameters for the data set; 


3. Compute the LM,, and Z,; statistics for all the item-pairs for which the local inde- 


pendence assumption was violated. 


Results for Data Simulated Under No Model Misfit 


Except for the LM; and LM_, statistics, all the other statistics have satisfactory 
Type I error rates (Appendix B includes a table showing these rates)—the rates are close 


to and often smaller than the nominal level in all simulation conditions. 


Results for Data Simulated Under Some Item Misfit 
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Figure 1: Density plots of logarithm of response times for four items. 


Figure 1 shows the density plots of the logarithm of response times for four items from 


the simulation cases involving 5,000 examinees and 80 items—the LNMRT fits the data for 


Item 1 (circles on a dotted line) so that the logarithms of the response times follow the 


normal distribution for that item and does not fit the other items. Item 2 (triangles on a 


dotted line) represents an item on which 10% examinees had preknowledge and answered 
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the item in 5 seconds. The response times for Item 3 (plus symbols on a dotted line) were 
simulated from the Box-Cox normal model (Klein Entink et al., 2009). The response times 
for Item 4 (multiplication symbols on a dotted line) represents a shift of 5 seconds for the 
wrong responses. The figure shows that the distribution of the response times for Item 2 is 
bimodal. The figure also shows that compared to the item with no misfit, the distribution 
of the logarithm of response times for Item 3 is shifted slightly towards the left and that 
for Item 4 is shifted slightly towards the right. In addition, the distributions for Items 2-4 
have lower peak compared to that of Item 1. The value of the Nikulin-Rao-Robson statistic 
is approximately 51, 2805, 77, and 534 for the items, the critical value at 5% level being 
66.3 (that is the 95th percentile of a x? distribution with 49 degrees of freedom). 

Other factors remaining the same, the power of each statistic was very similar over 
different test lengths. Figures 2 to 4 respectively show the average power (averaging over 
the different test lengths) for detecting the three types of item misfit for different values of 
sample size (/), percent of aberrant examinees, and the extent of misfit (denoted by the 
time-taken-to-answer-the-compromised-items, v-parameter, or the shift for the three types 
of misfit) of four of the item-fit statistics. The values of power at 5% level are reported in 
these figures; the conclusions from power at 1% level (not reported here) are very similar. 
The three rows of the figures correspond to sample sizes 500, 1,000, and 5,000, respectively. 
The three panels in each row show the average values of power of the statistics for various 
extent of misfit. In each panel, the percent of aberrant examinees is shown along the 
X-axis and the power of each statistic is shown along the Y-axis. The power for the 
Shapiro-Wilk (SW) statistic, Anderson-Darling (AD) statistic, Nikulin-Rao-Robson (NRR) 
statistic, and the Ranger-Kuhn’s (RK) T; statistic are shown using hollow circles, hollow 
triangles, plus signs, and multiplication signs, respectively, joined by a dotted line. 

The power of the Lagrange multiplier item fit statistic (LM;) is much smaller compared 
to the other statistics—hence this statistic is not included in Figures 2 to 4. The low power 
may be due to the fact that the misfit created in this paper is not the type of misfit that 
the statistic is ideal to detect. In limited simulations, the LM, statistic was found to have 


larger power when item misfit was created by simulating data using Equation 4 instead of 
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Figure 2: Average power across test lengths to detect the first type of item misfit (Preknowl- 


edge) of the item-fit statistics. 


Equation 1.1 
Figures 2-4 show that: 


e Power increases with an increase in sample size, which is a favorable result for the 


item-fit statistics (e.g, Rao, 1973, p. 464). 
e Power becomes larger as the percent of aberrant examinees increases. 


e Power mostly becomes larger as the extent of misfit (denoted by time or the v- 


10However, from a study of the estimated residuals given by Equation 5 for our real data sets to be 
discussed later, it was not clear that misfit created by using Equation 4 is prevalent in practice—so this type 


of misfit is not considered in our simulation study. 
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Figure 3: Average power across test lengths to detect the second type of item misfit (Box-Cox 


normal distribution) of the item-fit statistics. 


parameter or the shift) increases. 


e The power of all the statistics is very close to 1 in the bottom right panel (that is, for 


large samples and large extent of item misfit) in each figure. 


e No item-fit statistic uniformly has the largest power in all simulation cases, but the 


Shapiro-Wilk statistic comes close to achieving this distinction. The statistic consis- 


tently has the largest power for small sample sizes and small percent-aberrant exami- 


nees. The large power of the Shapiro-Wilk statistic is in agreement with the superior 


performance of the Shapiro-Wilk statistic in several comparisons of normality tests (re- 


viewed earlier). 
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Figure 4: Average power across test lengths to detect the third type of item misfit (a shift 
in time for incorrect answers) of the item-fit statistics. 

e The Nikulin-Rao-Robson y? statistic has the largest power in a few cases (for example, 

in the bottom left and middle panels of Figure 2), but is less powerful than the Shapiro- 


Wilk and Anderson-Darling statistics in general. 


e The Ranger-Kuhn statistic (7) has the smallest power overall. 


Results for Data Simulated Under Violation of Local Independence 

Figure 5 shows the average values of power (averaging over the test lengths) of LM;, 
and Z;,,; to detect violation of local independence of the two types in a similar manner as 
in Figure 2 except that the three panels in each row of the figure correspond to the three 


values of the response times under speededness (top row) or the three values (0.05, 0.1, and 
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Figure 5: Power to Detect Violation of Local Independence. 


0.2) of pj, (bottom row) and each panel shows results for all sample sizes, using dotted 


lines (sample size of 5,000), dashed lines (1,000) and solid lines (500). Figure 5 shows that: 


The power of both statistics increases as the sample size or the percent of aberrant 


examinees increases. 


For the sample size of 5,000, the power of Z,; is extremely high if the percent of 


aberrant examinees is 20 or larger. 


The power increases with a decrease in time (of responding under speededness) or an 


increase IN Pjk. 


The Z,; statistic is much more powerful than LM;, in all simulation cases. 
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Real Data Examples 
Example 1: A Mathematics Test 


Let us consider a real data set that consists of responses and response times of 1,079 
American test takers in grade 8 on 40 mathematics items. The data were analyzed by van 
Rijn and Ali (2017) and were collected as part of a larger study. Thirty-two items are 
multiple-choice and eight are numeric entry. The items focus on basic topics in number, 
measurement, geometry, data analysis and algebra, and are dichotomously scored. The 
items were assembled in four different forms using blocks of ten items, with different orders 
of the blocks to counterbalance order effects. The time limit was 90 minutes. 

The test was administered under low-stakes conditions—so we computed the response 
time effort (RTE) measure of Wise and Kong (2005) for the data set to examine if the 
examinees suffered from a lack of motivation. The RTE for an examinee is the proportion 
of items for which the response time of the examinee is above a cutoff. With a cutoff of 
5 seconds, a value of 0.8 of the RTE means that an examinee took more than 5 seconds 
on 80% of the items (so lower values of RTE mean less effort). With a cutoff of 5, 10 and 
15 seconds, respectively, 0, 10 and 64 out of the 1079 examinees had an RTE less than 
.80. With a cutoff value of 10 seconds, the lowest RTE value found for the data set is .45, 
meaning that the corresponding examinee spent 10 seconds or less on 55% of the items. 
The number of examinees answering an item in less than 10 seconds range between 10 
and 151 for the items. These numbers indicate that overall, there is not enough evidence 
that many examinees suffered from a lack of motivation. Figure 6 shows the total time 
in minutes versus the raw score on the test—it shows almost no correlation (correlation 
coefficient=-0.05) between the total time and raw score. 

The MLEs of a,’s were between 1.18 and 2.05 and those of {;’s were between 
2.91 and 4.96, respectively. Values of both the Shapiro-Wilk test statistic and the 
Nikulin-Rao-Robson x? statistic were computed for all the items in the data set. The 
number of groups used was 20 for the latter test. At 5% level, all the items were found to 


have statistically significant values of both the statistics. At 1% level, 97.5% and 100% of 
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Figure 6: Plot of total time versus raw score for the Mathematics data. 

the items were found to have statistically significant values of the Shapiro-Wilk statistic 
and the Nikulin-Rao-Robson statistic, respectively. 

Figure 7 shows the normal probability plots of the log-response times for nine 
items (randomly chosen) among the 40 items. In each panel, a diagonal line is also shown 
for convenience—a curve close to the diagonal line would indicate a good fit of the LNMRT 
to the data. The figure shows that while the fit is not too bad in the right side of the panels, 
the curve drops well below the diagonal line towards the left of several panels. One cause 
of this drop is quick responding by several examinees. This is demonstrated in Figure 8. 
The left panel of the figure shows the standardized residuals e;;’s (van der Linden & Guo, 
2008) versus the estimated 7;’s for all the examinees for Item 8. Horizontal lines are drawn 
at 2 and -2—residuals outside these lines are statistically significant at 5% significance 
level. The right panel shows the actual response times versus the estimated 7;’s for all the 
examinees for the item—a logarithmic scale is used for the vertical axis because of the 
presence of several large response times. Horizontal lines are drawn at 10 and 20 seconds. 
The left panel shows many significant residuals (for about 6% of examinees)—most of these 
are negative, indicating that several examinees responded to the item much sooner than 


what can be expected under the LNMRT. The right panel shows that a substantial number 
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Figure 7: Normal probability plots for Nine items from the Mathematics data. 


of examinees answered the item within 5-15 seconds, whereas the average response time for 


the item was 61 seconds, and justifies the way the first type of item misfit was created in 


our simulation study discussed earlier. 


Figure 9 provides a deeper look at the relationship between the misfit and quick 


responding. The left panel shows a plot of the average per-item time (in seconds) of the 


examinees (X-axis) versus the values of a person-fit statistic using response times (Sinharay, 


2018) whose larger values indicate more misfit’! (Y-axis). A horizontal dashed line is 


provided at the critical value of the person-fit statistic at 5% significance level—values of the 


statistic above this line are significant. The correlation between the two plotted quantities 


1A misfit for a person indicates that the LNMRT does not adequately fit the response times of that 


person. 
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Figure 8: The residuals and response times versus the estimated speed parameters for Item 8. 
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Figure 9: The relationship between misfit of the LNMRT and quick responding. 


is -0.25, which, together with the plot, indicates that more misfit is associated with quicker 


responding. Especially, the three quickest examinees (who appear on the extreme left of 


the top panel) all have significant values of the person-fit statistic.1? The right panel shows 


a plot of the average per-person time (in seconds) on the items (X-axis) for all examinees 


!2T addition, their raw scores are 8, 18, and 23, respectively—so it is not clear that they answered quickly 


because they were very strong in mathematics. 
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in the sample versus the response times (in seconds) on the items (Y-axis) for one of the 
examinees who appear on the extreme left of the top panel; a diagonal (dashed) line is 
added to the plot; the panel shows that the examinee answered all the items except one 
faster than the other examinees on average. Thus, Figure 9 shows that a part of the severe 
extent of item misfit in the data can be attributed to some examinees who responded 
quickly. But several points above the horizontal line and towards the right of the left panel 
of Figure 9 show that some examinees took longer than average and yet had significant 
values of the person-fit statistic—so quick responding is not the only source of the misfit of 
the LNMRT to the data. 

In addition, the S — y? item-fit statistic based on item scores and suggested by Orlando 
and Thissen (2000) was computed for all the items—the statistic was significant for 7 
items (or about 18% of the items) at 5% significance level. The correlation between S' — x? 
and the Nikulin-Rao-Robson y? statistic was 0.25—so there seems to be a small positive 
association between item fit based on item scores and item fit based on response times. 

The value of the Z,; statistic was significant at 5% level for 37.3% item-pairs. Figure 10 
shows the values of the Zz,; statistic for the item-pairs. The figure was created using the R. 
package corrplot (Wei & Simko, 2017). The item numbers are shown at the left and the top 
of the figure. A larger black or white square for a pair of items indicates a Z,; for that pair 
that is large in absolute value. Black and white squares indicate statistically significant (at 
5% significance level) and positive and negative values, respectively. If Z,; for an item pair 
is statistically not significant at 5% level, then no square is drawn for that pair (and the 
background for that item pair remains gray). The several large black squares close to the 
diagonal indicate that the response-times for the item-pairs within each block (of 10 items) 
are more correlated than what is expected under the LNMRT. Especially, there are groups 
of black squares in the top left corner and the bottom right corner of the figure. There are 
several large white squares indicating, for example, that the response-times for an item-pair 
with one item among items 21-28 and another among items 11-20 are less correlated than 


what is expected under the LNMRT. 
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Figure 10: A Plot of the values of the Zz; statistic for the Mathematics data. 


Example 2: A Licensure Test 


Two data sets from a licensure test were analyzed in several chapters of Cizek and 
Wollack (2017). We consider one of these data sets, which includes item scores and 
response times of 1,629 examinees on one test form with 170 operational items that are 
dichotomously scored. Sinharay and Johnson (2019) fitted a joint model that includes 
the two-parameter logistic IRT model and the LNMRT to this data set to detect item 
preknowledge. Figure 11 shows the total time in minutes versus the raw score on the 
test. There is a negative correlation (of -0.25) between the total time and the raw score 
on the test and, unlike for the mathematics data, the quickest examinees (say, those who 
took less than 100 minutes) obtained large raw scores. About 10 examinees seem to have 


spent considerably more time than the rest, but information on why that happened was 
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Figure 11: Plot of total time versus raw score for the Licensure data. 


unavailable to the authors—accommodation is a possible explanation.'? 


The MLEs of the a,’s were between 1.37 and 2.70 and those of the 3;’s were between 
2.81 and 4.87, respectively. The Shapiro-Wilk test statistic (Shapiro & Wilk, 1965) and the 
Nikulin-Rao-Robson y? statistic (Nikulin, 1973; Rao & Robson, 1974) were computed for all 
the items in the data set. The number of groups used was 30 for the latter test. At 5% level 
of significance, 73.5% and 67.1% of the items were found to have statistically significant 
values of the Shapiro-Wilk statistic and the Nikulin-Rao-Robson statistic, respectively. At 
1% level, the percentages were 67.1% and 53.5%, respectively. The S — x? statistic (Orlando 
& Thissen, 2000) was significant for only 7.6% items for the data set and the statistic had a 
slightly negative relationship with the Nikulin-Rao-Robson statistic and the Shapiro-Wilk 
statistic. In addition, unlike the mathematics test, the correlation between the person-fit 
statistic of Sinharay (2018) and average response time is -0.02, which indicates that the 
model misfit in the data cannot be explained by quick responding. The Zz; statistic was 
significant at 5% level for 30.9% item-pairs. 

Figure 12 shows the normal probability plots of the log-response times for nine 


items (randomly chosen) among the 170 items. The figure shows some signs of departure 


134 model-fit analysis after removing these examinees does not lead to much differences in the results—so 


these examinees are included in the analyses. 
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Figure 12: Normal probability plots for nine items from the Licensure data. 


of the log-response times from a normal distribution, but the extent of departure is 


considerably less than that in Figure 7. 
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Figure 13: The Density of the Response Times for Two Items. 


32 


Figure 13 shows the density of the response times versus that of the best-fitting 
lognormal distribution for two items represented in Figure 12—Item 33 (Column 2 of the 
top row in Figure 12) and Item 164 (Column 3 of the middle row in Figure 12). These two 
items were picked because the model misfit appears not severe for the former and severe 
for the latter. While the density of the response times is close to that of the lognormal 
distribution for Item 33 (left panel), there is a substantial gap between the two curves for 
Item 164 (right panel). Several individuals take longer than what is expected from the 
LNMRT for the latter item!* for which the mean and standard deviation of the response 


time are about 35 and 31, respectively. 


Conclusions and Recommendations 


This paper focuses on the LNMRT and suggests the use of several statistics for assessing 
item fit and local independence of the LNMRT. A simulation study demonstrates that the 
suggested statistics have satisfactory Type I error rate and power, especially when compared 
to the existing fit statistics. In general, the Shapiro-Wilk statistic and the Zz; statistic had 
the largest power to detect item misfit and violation of local independence, respectively. 
Two real data applications demonstrate the usefulness of the statistics. Computer codes for 
computing the new fit statistics are also provided. 

The item-fit statistics are based on the classical tests for normality (e.g., Thode, 2002) 
and the Nikulin-Rao-Robson (Nikulin, 1973) test. The test for local independence is based 
on standardized residuals in structural equation models. The asymptotic null distributions 
of all the suggested statistics are known and/or tables of critical values for the test statistics 
are publicly available. The simplicity of the suggested methodologies and their strong 
theoretical basis (in the form of asymptotic null distributions) promise to make them 
attractive to those interested in assessing the goodness of fit of response-time models. 

The LNMRT was found to offer inadequate fit to two real data sets—one from a 
grade 8 mathematics test and one from a licensure test. The percentages of statistically 


significant values of the item-fit statistics and Zz; were much larger than the nominal 


14This is most clear for values of time about 100 
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level for both of these data sets. Similar poor fit was observed for two other real data sets 
from tests that have a time limit (results not discussed and can be obtained from the lead 
author). This result of poor fit of the LNMRT for multiple real data sets is important given 
the observation of Bolsinova and Tijmstra (2018, p. 13) that the LNMRT is used in most 

applications of response-time modeling. 

Researchers such as Gelman et al. (2014, p. 151) noted that finding an extreme p-value 
and thus rejecting a model is never the end of an analysis. Therefore, a natural question in 
the context of this paper is “What should a practitioner do when a misfit of the LNMRT is 
found?” It is possible to do several things if a misfit is found. First, as Gelman et al. (2014, 
p. 151) stated, one can look for other models, including extensions of the current model, 
that may improve the fit. In our context, the Box-Cox normal model for the response times 
is a possible extension of the LNMRT that was found to fit response-times data better 
by Klein Entink et al. (2009); the extension of the LNMRT suggested by Bolsinova and 
Tijmstra (2018) is another possible candidate. Second, given that the data examples show 
the presence of some outlying/aberrant examinees, a simple extension may not fit the data 
and one may need to fit a mixture response-time model (that assumes one model for normal 
responses and another model for aberrant responses) such as that of Wang, Xu, Shang, and 
Kuncel (2018). Third, as noted by researchers such as Sinharay and Haberman (2014), 
practitioners should assess practical significance of any model misfit—such an assessment 
aims to answer questions such as “Are the main inferences made from the model influenced 
by the model misfit?” and “Can the model, with its misfit, still be used for the present 
problem?” The assessment of practical significance is problem-specific and depends heavily 
on the purpose for which the model is being used. An example of such an analysis would 
be that in an application of the LNMRT to person-fit analysis using response times (as 
in, for example, Sinharay, 2018), one finds 10% misfitting examinees, but then applies 
the Box-Cox normal model (Klein Entink et al., 2009) to find the percent of misfitting 
examinees; if only 5% examinees are found misfitting with the Box-Cox normal model, then 
the misfit of the LNMRT is statistically significant. 


Our paper has several limitations. First, applications of the suggested statistics to 
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more simulated and real data would provide more insight into these statistics. Second, 
extension of the suggested statistics to more complicated response-time models is a possible 
topic for future research. Third, other types of item-fit statistics for these models may be 
helpful. For example, the suggested statistics have low power under certain conditions and 
research on finding more powerful statistics will be useful. Finally, research on finding 
effect sizes corresponding to the suggested statistics would be useful. One way to examine 
effect size would be to find out the practical significance of the misfit (Sinharay & 
Haberman, 2014). 
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Appendix A: R Functions to Compute the MLEs and the New Statistics 


# Use of the R package ‘lavaan’ to fit the LNMRT 
library (lavaan) 
dy <- data.frame(y)# Data set (of dimension IxJ) with log-response times 
J=10#J, the number items, is assumed to be 10 for the data set dy 
model=paste("f1="", pasteO("a*X",1:(J-1),"+",collapse="") ,paste("a*X",J,sep="")) 
fit <- cfa(model, data = dy, meanstructure = TRUE, auto.var= TRUE) 
pars <- coef(fit) #pars[12:21], sqrt(1/pars[1:10]), and pars[11] 
#include the estimated beta’s, alpha’s, and sigma_squared 

# Compute p-values for the Shapiro-Wilk test 
ShapWilk=function(ltimes) 
{nitem=ncol (1times) 

SW=rep(0,nitem) 

for (i in 1:nitem) 

{times=ltimes[,i] 

SWli]=shapiro.test (times) $p.value} 

return (SW) } 
# Compute p-values for the Anderson-Darling test 
library (nortest) 
AnDa=function(ltimes)#Anderson-Darling test 
{nitem=ncol (ltimes) 

ad=rep(0,nitem) 

for (i in 1:nitem) 

{times=ltimes[,i] 

ad[i]l=ad.test (times) $p.value} 

return (ad) } 
# Compute the Nikulin-Rao-Robson Item-fit Statistic 
NRR=function(ltimes,G)#G is the number of groups 
{nitem=ncol (1ltimes) 

n=nrow(ltimes) 

eo=EpsOmegas (G) 

eps=eo[, 1] 

omega=eo[, 2] 

chisq=rep(0,nitem) 

p=(1:(G-1))/G 

for (i in 1:nitem) 

{times=ltimes[,i] 

q=mean(times)+(sd(times) )*qnorm(p) 

z=c(-100,q,100) 

obs=rep(0,G) 

for (j in 1:G) 

{obs[j]=length(times[times>z[j] & times<z[(j+1)]]) 

chisqli]=chisq[i]+(obs[j])**2} 


chisq[i]=G*chisq[i] /n-n+(sum(obs*eps) ) **2/n+(sum(obs*omega) )**2/n } 

return(chisq) } 
#A function that calculates some constants---these are input to the function NRR 
EpsOmegas=function(G) 

{y=c(-1000, qnorm( (1: (G-1))/G) , 1000) 

pi=dnorm(y[1:G]) 

p2=dnorm(y[2: (G+1)]) 

a=p2-pl 

b=y [1:G] *p1-y[2: (G+1)]*p2 

lam1=1-G*sum (axa) 

lam2=2-G*sum (b*b) 

return (cbind(G*a/sqrt (lam1) ,G*b/sqrt (lam2) ))} 
# Compute the Z_LI statistic for all item pairs 
TestLI=function(ltimes) 
{n=ncol (1times) 

colnames(ltimes)=paste("X",1:n,sep="") 
mod=paste("f1="",pasteO("a*X",1:(n-1) ,"+",collapse="") ,paste("a*X",n,sep="") ) 
fit <- cfa(mod, data = ltimes, meanstructure = TRUE, auto.var= TRUE) 
lr=lavResiduals(fit,type="cor.bollen") 

Zstat=lr$cov.z 

diag (Zstat)=0 

return(Zstat)} #Zstat: JxJ matrix consisting of the Z_LI for all item pairs 


Appendix B: Type I Error Rates of the Statistics in the Simulation Study 


Table Al: Type I Error Rates (as percentages) of the Shapiro-Wilk statistic (W), Anderson- 
Darling statistic (A? ), Nikulin-Rao-Robson y? statistic (x3), Ranger-Kuhn’s statistic (Tj), 
the Lagrange multiplier item-fit statistic (LM;), the Lagrange multiplier local-independence 
statistic (ZMz_,), and the Z,;,; statistic. 


Statistic 20 Items 40 Items 60 Items 
500 1,000 5,000 500 1,000 5,000 500 1,000 5,000 
W 4.9 5.1 4.5 5.0 5.3 4.2 5.0 5.3 4.0 
A? 52 4.5 4.5 5.2 5.3 4.7 5.1 0.1 4.0 
x3, 5.3 4.9 4.5 4.9 5.2 4.5 4.8 5.1 4.7 
T; 49 4.9 5.0 4.9 5.1 5.0 5.0 4.9 5.0 
LM; 6.5 6.6 6.5 6.3 6.1 6.6 6.5 6.2 6.9 
LM, 6.5 6.3 6.6 6.3 6.5 6.6 6.6 6.8 6.9 
Lig At Ant 4.5 4.8 5.0 5.0 5.0 4.9 4.9 


Note: The three numbers—500, 1,000, and 5,000—at the top denote the sample sizes. 


Table Al shows the Type I error rates (at 5% level of significance) as percentages, rounded 
to the first decimal place, of the item-fit and local-independence statistics. The table shows 
that except for the LM; and LM,, statistics, all the other statistics have satisfactory 
Type I error rates—the rates are close to and often smaller than the nominal level in all 
simulation conditions. The conclusions on Type I error rates at 1% level (not reported 


here) are very similar. 


